Instructions

  1. Study Sections {7.6, 8.1, 10.5, 11.2} – “Multicollinearity and Polynomial Models.”

  2. Attempt and submit at least {50} Hard Work Points by Saturday at 11:59 PM.
    Over {65} gets you {+1} Final Exam Point.
    Over {80} gets you {+2} Final Exam Points.

Reading Points {23} Possible

Section 7.6 {7}{ 6 / 7 }

Section 8.1 {7}{ 7 / 7 }

Section 10.5 {4}{ 2 / 2 }

Section 11.2 {5}{ 1 / 2 }

Theory Points {4} Possible

Problem 7.23 {2}{ 2 / 2 }

The high coefficient of multiple determination would imply that they have a good fit. Multicollinearity generally does not have that much of an impact on the ability to to find a good fit.

However, it still makes it difficult to interpret the effects of each individual variable. This is because adding one correlated variable can drastically change the coefficient of another variable.

Problem 8.3 {2}{ 2 / 2 }

It can be difficult to say without seeing the actual data but I would suppose that the analyst did over-fit the data. Consider the fact that the sample size is only 7. I high order polynomial can easily be fitted to match each point.

A linear or quadratic might be able to just as good of a job and will most likely validate much better than the high order polynomial.

Seeing the adjusted R2 would probably be somewhat lower than R2. This would indicate that as more parameters are added the ability to explain the data might stay around the same.

Application Points {35} Possible

Data Files

Problem 8.1 {4}{ 4 / 4 }

Prepare a contour plot of the response function \(E \{ Y \} = 140+4x_1^2 - 2x_2^2 + 5x_1x_2\).

Below is the contour plot. Any negative expected values are plotted in blue while and positive expected values are plotted in green.

x<-seq(-10,10,length=100)
y<-seq(-10,10,length=100)
z<-outer(x,y,function(x,y) 140+(4*x^2)-(2*y^2)+(5*x*y) )
contour(x,y,z,levels=c(50, 100, 125, 135, 150, 175, 200), xlab="X1", ylab="X2", main="Contour Plot of 140+4X1^2 - 2X2^2 + 5X1*X2", col="green")
contour(x,y,z,levels=c(0, -50, -100, -125, -135), add=TRUE, col="blue")

For the sake of learning more code here is also a 3D plot of the same surface.

x<-seq(-10,10,length=100)
y<-seq(-10,10,length=100)
z<-outer(x,y,function(x,y) 140+(4*x^2)-(2*y^2)+(5*x*y) )

data <- data.frame(x, y, z)

p <- plot_ly(data, x = ~x, y = ~y, z = ~z, type = 'surface', mode = 'lines+markers',
        line = list(width = 6))
        #marker = list(size = 3.5, cmin = -20, cmax = 50))
p

Problem 8.4 {8}{ 7 / 8 }

# Here is generic code that would read in any "Problem" from any "Chapter":

Chapter <- 8 #change this to the chapter you want
Problem <- 4 #change this to the problem you want

#Change pWhatever.Whatever to be p11.11 or whatever you want.
p8.4 <- read.table("http://users.stat.ufl.edu/~rrandles/sta4210/Rclassnotes/data/textdatasets/KutnerData/Chapter%20%201%20Data%20Sets/CH01PR27.txt", header=FALSE)

# Give it nice column names for conevience:
# Be sure to change pWhatever.Whatever to p11.11 or whatever you used above.
colnames(p8.4) <- c("Y","X1")

a.

Fit the data to the model

\[ Y_i = \beta_0 + \beta_1x_i + \beta_{11}x_i^2 + \epsilon_i, \ \ \text{where} \ x_i = X_i - \bar{X}. \]

p8.4.lm <- lm(Y ~ I(X1-mean(p8.4$X1))  + I((X1-mean(p8.4$X1))^2), data=p8.4)
summary(p8.4.lm)
## 
## Call:
## lm(formula = Y ~ I(X1 - mean(p8.4$X1)) + I((X1 - mean(p8.4$X1))^2), 
##     data = p8.4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.086  -6.154  -1.088   6.220  20.578 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               82.935749   1.543146  53.745   <2e-16 ***
## I(X1 - mean(p8.4$X1))     -1.183958   0.088633 -13.358   <2e-16 ***
## I((X1 - mean(p8.4$X1))^2)  0.014840   0.008357   1.776   0.0811 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.026 on 57 degrees of freedom
## Multiple R-squared:  0.7632, Adjusted R-squared:  0.7549 
## F-statistic: 91.84 on 2 and 57 DF,  p-value: < 2.2e-16
b<-p8.4.lm$coefficients
plot(Y ~ I(X1-mean(p8.4$X1)), data=p8.4, main="Plotted with centered X1")
curve(b[1] + b[2]*x + b[3]*x^2, add=TRUE)

plot(Y ~ X1, data=p8.4, main="Plotted in original units")
curve(b[1] - b[2]*mean(p8.4$X1) + b[3]*mean(p8.4$X1)^2 + (b[2]-2*b[3]*mean(p8.4$X1))*x + b[3]*x^2, add=TRUE)

R2 is .7632 which is decent but not great. I linear line might be able to do just as well as the quadratic but it appears to fit well. Variance might be a bit of an issue.

b.

To test for regression relation see page 226.

qf(1-.05, 2, 57)
## [1] 3.158843

\(H_0: \beta_1 = \beta_11 = 0\)

\(H_a: \text{at least one} \ \beta_k \ne 0\)

If \(F* \le 3.158843\), conclude the null. Otherwise conclude the alternative.

Since \(F* = 91.84 > 3.158843\) we conclude the alternative in that there is a regression relation.

c.

predict(p8.4.lm, data.frame(X1=48), interval="confidence")
##        fit      lwr      upr
## 1 99.25461 96.28436 102.2249

d.

predict(p8.4.lm, data.frame(X1=48), interval="prediction")
##        fit     lwr      upr
## 1 99.25461 82.9116 115.5976
plot(Y ~ X1, data=p8.4, main="Confidence (blue) and prediction limits (red) at X1=48")
curve(b[1] - b[2]*mean(p8.4$X1) + b[3]*mean(p8.4$X1)^2 + (b[2]-2*b[3]*mean(p8.4$X1))*x + b[3]*x^2, add=TRUE)
abline(h=96.28436, col="blue")
abline(h=102.2249, col="blue")
abline(h=82.9116, col="firebrick")
abline(h=115.5976, col="firebrick")
abline(v=48)

e.

\[ H_0: \beta_{11} = 0 \\ H_a: \beta_{11} \ne 0 \]

qt(1-.05/2, 57)
## [1] 2.002465

If \(|t^*| < 2.002465\) conclude the null, otherwise conclude the alternative.

Since, \(1.776 < 2.002465\) we conclude the null. The quadratic term is not needed in the model.

f.

In the original units the fitted regression becomes

\[ \hat{Y} = b'_0 + b_1'X + b'_{11}X^2 \ \ \text{where:} \\ b_0' = b_0 -b_1\bar{X} + b_{11}\bar{X}^2 \\ b_1' = b_1 - 2b_b{11}\bar{X} \\ b'_{11} = b_{11} \]

With values for b the function becomes

\[ \hat{Y} = 207.35 + -2.9643 X + 0.0148X^2 \]

g.

Correlation between X and X^2.

cor(p8.4$X1, (p8.4$X1)^2)
## [1] 0.9960939

Correlation between x and x^2.

cor((p8.4$X1-mean(p8.4$X1)), (p8.4$X1-mean(p8.4$X1))^2)
## [1] -0.03835694

Using a centered variables greatly helps reduce correlation and thereby multicollinearity.

Problem 8.5 {5}{ 4 / 5 }

a.

plot(p8.4.lm$residuals ~ I(X1-mean(p8.4$X1)), data=p8.4)

plot(p8.4.lm, which=1:2)

The residuals are pretty scattered. There doesn’t appear to much of a pattern anywhere. Variance is slightly off but it shouldn’t be to much of a concern.

b.

length(unique(p8.4$X1))
## [1] 32
qf(1-.05, 32-2, 60-32)
## [1] 1.868709

If \(F^* \le 1.868709\) then we conclude the null, otherwise we conclude the alternative.

pureErrorAnova(p8.4.lm)
## Analysis of Variance Table
## 
## Response: Y
##                           Df  Sum Sq Mean Sq  F value    Pr(>F)    
## I(X1 - mean(p8.4$X1))      1 11627.5 11627.5 176.0153 1.346e-13 ***
## I((X1 - mean(p8.4$X1))^2)  1   203.1   203.1   3.0750   0.09045 .  
## Residuals                 57  3671.3    64.4                       
##  Lack of fit              29  1821.6    62.8   0.9509   0.55387    
##  Pure Error               28  1849.7    66.1                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

.9509 < 1.868709, therefore the regression model is a good fit for the data.

The assumptions for this test is that Y for given X are (1) independent and (2) normally distributed, and that (3) the distributions of Y have the same variance.

  1. and (3) should be met but (2) could be questionable.

c.

p8.4.lm2 <- lm(Y ~ I(X1-mean(p8.4$X1))  + I((X1-mean(p8.4$X1))^2) + I((X1-mean(p8.4$X1))^3), data=p8.4)
summary(p8.4.lm2)
## 
## Call:
## lm(formula = Y ~ I(X1 - mean(p8.4$X1)) + I((X1 - mean(p8.4$X1))^2) + 
##     I((X1 - mean(p8.4$X1))^3), data = p8.4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3671  -5.8483  -0.6755   6.1376  20.0637 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               82.9273444  1.5552264  53.322  < 2e-16 ***
## I(X1 - mean(p8.4$X1))     -1.2678894  0.2489231  -5.093 4.28e-06 ***
## I((X1 - mean(p8.4$X1))^2)  0.0150390  0.0084390   1.782   0.0802 .  
## I((X1 - mean(p8.4$X1))^3)  0.0003369  0.0009327   0.361   0.7193    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.087 on 56 degrees of freedom
## Multiple R-squared:  0.7637, Adjusted R-squared:  0.7511 
## F-statistic: 60.34 on 3 and 56 DF,  p-value: < 2.2e-16
b2<-p8.4.lm2$coefficients
plot(Y ~ I(X1-mean(p8.4$X1)), data=p8.4, main="Plotted with centered X1")
curve(b2[1] + b2[2]*x + b2[3]*x^2 + b2[4]*x^3, add=TRUE)

\[ H_0: \beta_{111} = 0 \\ H_a: \beta_{111} \ne 0 \]

qt(1-.05/2, 57)
## [1] 2.002465

If \(|t^*| < 2.002465\) conclude the null, otherwise conclude the alternative.

Since, \(0.7193 < 2.002465\) we conclude the null. The cubic term is not needed in the model.

This is consistent with our answer in part (b) because the simpler model was able to fit the data just as well as this model. The third interaction term is not needed.

Problem 8.6 {8}{ 7 / 8}

# Here is generic code that would read in any "Problem" from any "Chapter":

Chapter <- 8 #change this to the chapter you want
Problem <- 6 #change this to the problem you want

#Change pWhatever.Whatever to be p11.11 or whatever you want.
p8.6 <- read.table("http://users.stat.ufl.edu/~rrandles/sta4210/Rclassnotes/data/textdatasets/KutnerData/Chapter%20%208%20Data%20Sets/CH08PR06.txt", header=FALSE)

# Give it nice column names for conevience:
# Be sure to change pWhatever.Whatever to p11.11 or whatever you used above.
colnames(p8.6) <- c("Y","X1")

a.

Fit the data to the model

\[ Y_i = \beta_0 + \beta_1x_i + \beta_{11}x_i^2 + \epsilon_i, \ \ \text{where} \ x_i = X_i - \bar{X}. \]

p8.6.lm <- lm(Y ~ I(X1-mean(X1))  + I((X1-mean(X1))^2), data=p8.6)
summary(p8.6.lm)
## 
## Call:
## lm(formula = Y ~ I(X1 - mean(X1)) + I((X1 - mean(X1))^2), data = p8.6)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5463 -2.5369  0.3868  2.1973  5.3020 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          21.09416    0.91415  23.075  < 2e-16 ***
## I(X1 - mean(X1))      1.13736    0.11546   9.851 6.59e-10 ***
## I((X1 - mean(X1))^2) -0.11840    0.02347  -5.045 3.71e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.153 on 24 degrees of freedom
## Multiple R-squared:  0.8143, Adjusted R-squared:  0.7989 
## F-statistic: 52.63 on 2 and 24 DF,  p-value: 1.678e-09
b<-p8.6.lm$coefficients
plot(Y ~ I(X1-mean(p8.6$X1)), data=p8.6, main="Plotted with centered X1")
curve(b[1] + b[2]*x + b[3]*x^2, add=TRUE, col="red")

plot(Y ~ X1, data=p8.6, main="Plotted in original units")
curve(b[1] - b[2]*mean(p8.6$X1) + b[3]*mean(p8.6$X1)^2 + (b[2]-2*b[3]*mean(p8.6$X1))*x + b[3]*x^2, add=TRUE)

R2 is .8143 which is good for the most part.

b.

To test for regression relation see page 226.

qf(1-.01, 2, 24)
## [1] 5.613591

\(H_0: \beta_1 = \beta_11 = 0\)

\(H_a: \text{at least one} \ \beta_k \ne 0\)

If \(F* \le 5.613591\), conclude the null. Otherwise conclude the alternative.

Since \(F* = 52.63 > 5.613591\) we conclude the alternative in that there is a regression relation.

c.

MSE <- 3.153^2 

xb <- mean(p8.6$X1)
X <- cbind(x=(p8.6$X1-xb), x2=(p8.6$X1-xb)^2)
Xh <- c(10-xb, (10-xb)^2)
s2 <- MSE * (t(Xh) %*% solve(t(X)%*%X) %*% Xh )
s <- sqrt(s2)
t <- qt(1-.1/6, 27)

predict(p8.6.lm, data.frame(X1=10)) - t*s
predict(p8.6.lm, data.frame(X1=10)) + t*s
predict(p8.6.lm, data.frame(X1=c(10,15,20)), interval="confidence", level=1-0.1/3)
##        fit      lwr      upr
## 1 12.44734 10.57845 14.31624
## 2 21.09416 19.03025 23.15807
## 3 23.82092 21.88340 25.75843

d.

predict(p8.6.lm, data.frame(X1=15), interval="prediction", level = .99)
##        fit      lwr      upr
## 1 21.09416 11.91318 30.27514

e.

\[ H_0: \beta_{11} = 0 \\ H_a: \beta_{11} \ne 0 \]

qt(1-.01/2, 24)
## [1] 2.79694

If \(|t^*| < 2.79694\) conclude the null, otherwise conclude the alternative.

Since, \(|-5.045| > 2.79694\) we conclude the alternative. The quadratic term cannot be dropped from the model.

f.

In the original units the fitted regression becomes

\[ \hat{Y} = b'_0 + b_1'X + b'_{11}X^2 \ \ \text{where:} \\ b_0' = b_0 -b_1\bar{X} + b_{11}\bar{X}^2 \\ b_1' = b_1 - 2b_b{11}\bar{X} \\ b'_{11} = b_{11} \]

With values for b the function becomes

\[ \hat{Y} = -26.3254 + 4.87357 X + -0.1184X^2 \]

Its not required but the sake of curiosity here is how the centered variables reducted correlation.

Correlation between X and X^2.

cor(p8.6$X1, (p8.6$X1)^2)
## [1] 0.9894303

Correlation between x and x^2.

cor((p8.6$X1-mean(p8.6$X1)), (p8.6$X1-mean(p8.6$X1))^2)
## [1] 0.228599

Using a centered variables greatly helps reduce correlation and thereby multicollinearity.

Problem Births78 {10}{ 8 / 10}

Find the best regression model that uses dayofyear to predict births.

plot(births ~ dayofyear, data=Births78, col=Births78$wday, pch=16)
legend("topleft", legend=levels(Births78$wday), lty=1, lwd=5, cex=0.7)

Sunday and Saturdays are clearly lower then every than the week days, with Sunday appearing to be the lower than Saturday.

mean(Births78$dayofyear)
## [1] 183
birthlm <- lm(births ~ I(dayofyear-183) + I((dayofyear-183)^2) + I((dayofyear-183)^3) + I((dayofyear-183)^4) + I((dayofyear-183)^5) + wday, data=Births78)
summary(birthlm)
## 
## Call:
## lm(formula = births ~ I(dayofyear - 183) + I((dayofyear - 183)^2) + 
##     I((dayofyear - 183)^3) + I((dayofyear - 183)^4) + I((dayofyear - 
##     183)^5) + wday, data = Births78)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1727.06  -122.57    14.85   171.24   845.54 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             9.271e+03  2.928e+01 316.592  < 2e-16 ***
## I(dayofyear - 183)      1.376e+01  6.485e-01  21.222  < 2e-16 ***
## I((dayofyear - 183)^2) -1.556e-02  5.506e-03  -2.826 0.004981 ** 
## I((dayofyear - 183)^3) -1.221e-03  7.650e-05 -15.955  < 2e-16 ***
## I((dayofyear - 183)^4)  1.688e-07  1.848e-07   0.913 0.361856    
## I((dayofyear - 183)^5)  2.730e-08  2.016e-09  13.544  < 2e-16 ***
## wday.L                  2.439e+02  4.127e+01   5.908 8.14e-09 ***
## wday.Q                 -1.561e+03  4.127e+01 -37.823  < 2e-16 ***
## wday.C                  1.375e+02  4.132e+01   3.329 0.000965 ***
## wday^4                 -6.480e+02  4.135e+01 -15.670  < 2e-16 ***
## wday^5                 -1.939e+02  4.137e+01  -4.688 3.95e-06 ***
## wday^6                  6.448e+00  4.137e+01   0.156 0.876245    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 298.4 on 353 degrees of freedom
## Multiple R-squared:  0.8709, Adjusted R-squared:  0.8669 
## F-statistic: 216.6 on 11 and 353 DF,  p-value: < 2.2e-16
b<-birthlm$coefficients
plot(births ~ I(dayofyear-183), data=Births78, col=Births78$wday, pch=16)
curve(b[1] + b[2]*x + b[3]*x^2 + b[4]*x^3 + b[5]*x^4 + b[6]*x^5, add=TRUE)
curve(b[1] +b[7]+ b[2]*x + b[3]*x^2 + b[4]*x^3 + b[5]*x^4 + b[6]*x^5, add=TRUE)
curve(b[1] +b[8]+ b[2]*x + b[3]*x^2 + b[4]*x^3 + b[5]*x^4 + b[6]*x^5, add=TRUE)
curve(b[1] +b[9]+ b[2]*x + b[3]*x^2 + b[4]*x^3 + b[5]*x^4 + b[6]*x^5, add=TRUE)
curve(b[1] +b[10]+ b[2]*x + b[3]*x^2 + b[4]*x^3 + b[5]*x^4 + b[6]*x^5, add=TRUE)
curve(b[1] +b[11]+ b[2]*x + b[3]*x^2 + b[4]*x^3 + b[5]*x^4 + b[6]*x^5, add=TRUE)
curve(b[1] +b[12]+ b[2]*x + b[3]*x^2 + b[4]*x^3 + b[5]*x^4 + b[6]*x^5, add=TRUE)

birthlm2 <- lm(births ~ I(dayofyear-183) + I((dayofyear-183)^2) + I((dayofyear-183)^3) + I((dayofyear-183)^4) + I((dayofyear-183)^5) + weekend, data=BirthNew)
summary(birthlm2)
## 
## Call:
## lm(formula = births ~ I(dayofyear - 183) + I((dayofyear - 183)^2) + 
##     I((dayofyear - 183)^3) + I((dayofyear - 183)^4) + I((dayofyear - 
##     183)^5) + weekend, data = BirthNew)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1889.10  -169.35    18.74   178.76  1018.90 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             9.673e+03  3.382e+01 285.997   <2e-16 ***
## I(dayofyear - 183)      1.380e+01  7.089e-01  19.468   <2e-16 ***
## I((dayofyear - 183)^2) -1.517e-02  6.018e-03  -2.521   0.0121 *  
## I((dayofyear - 183)^3) -1.227e-03  8.361e-05 -14.678   <2e-16 ***
## I((dayofyear - 183)^4)  1.514e-07  2.020e-07   0.749   0.4542    
## I((dayofyear - 183)^5)  2.752e-08  2.203e-09  12.494   <2e-16 ***
## weekend1               -1.413e+03  3.772e+01 -37.455   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 326.1 on 358 degrees of freedom
## Multiple R-squared:  0.8436, Adjusted R-squared:  0.841 
## F-statistic: 321.9 on 6 and 358 DF,  p-value: < 2.2e-16
b<-birthlm2$coefficients
plot(births ~ I(dayofyear-183), data=Births78, col=Births78$wday, pch=16)
curve(b[1] + b[2]*x + b[3]*x^2 + b[4]*x^3 + b[5]*x^4 + b[6]*x^5, add=TRUE)
curve(b[1] +b[7]+ b[2]*x + b[3]*x^2 + b[4]*x^3 + b[5]*x^4 + b[6]*x^5, add=TRUE)

birthlm3 <- lm(births ~ I(dayofyear-183) + I((dayofyear-183)^2) + I((dayofyear-183)^3) + I((dayofyear-183)^4) + I((dayofyear-183)^5) + sday, data=BirthNew)
summary(birthlm3)
## 
## Call:
## lm(formula = births ~ I(dayofyear - 183) + I((dayofyear - 183)^2) + 
##     I((dayofyear - 183)^3) + I((dayofyear - 183)^4) + I((dayofyear - 
##     183)^5) + sday, data = BirthNew)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1889.36  -148.13    12.04   166.45  1018.34 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             9.674e+03  3.244e+01 298.201  < 2e-16 ***
## I(dayofyear - 183)      1.378e+01  6.799e-01  20.263  < 2e-16 ***
## I((dayofyear - 183)^2) -1.555e-02  5.772e-03  -2.694  0.00739 ** 
## I((dayofyear - 183)^3) -1.223e-03  8.019e-05 -15.251  < 2e-16 ***
## I((dayofyear - 183)^4)  1.684e-07  1.938e-07   0.869  0.38536    
## I((dayofyear - 183)^5)  2.738e-08  2.113e-09  12.959  < 2e-16 ***
## sday1                  -1.238e+03  4.753e+01 -26.046  < 2e-16 ***
## sday2                  -1.584e+03  4.716e+01 -33.599  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 312.8 on 357 degrees of freedom
## Multiple R-squared:  0.8565, Adjusted R-squared:  0.8537 
## F-statistic: 304.5 on 7 and 357 DF,  p-value: < 2.2e-16
b<-birthlm3$coefficients

ggplot(data = BirthNew, aes(x = as.numeric(I(dayofyear-183)), y = births)) + 
  geom_point()  + 
  aes(colour = wday) + 
  labs(title = "Centered Variabel") +
  stat_function(fun=function(x) (b[1]+b[2]*x + b[3]*x^2 + b[4]*x^3 + b[5]*x^4 + b[6]*x^5), col="black", lwd=1) +
  stat_function(fun=function(x) (b[1]+ b[7]+ b[2]*x + b[3]*x^2 + b[4]*x^3 + b[5]*x^4 + b[6]*x^5), col="maroon1", lwd=1) +
  stat_function(fun=function(x) (b[1]+b[8]+b[2]*x + b[3]*x^2 + b[4]*x^3 + b[5]*x^4 + b[6]*x^5), col="lightcoral", lwd=1)

The current model is

\[ Y_i = \beta_0 + \beta_1x_{i1} + \beta_{11}x_{i1}^2 + \beta_{111}x_{i1}^3 + \beta_{1111}x_{i1}^4 + \beta_{11111}x_{i1}^5 + \beta_2 X_{i2}+ \beta3 X_{i3} + \epsilon_i, \\ \text{where} \ x_{i1} = X_{i1} - \bar{X_1}, \\ X_{i2}=1 \ \text{if day is Saturday (coded as 1), otherwise} \ X_{i2} =0, \ \text{and} \\ X_{i3}=1 \ \text{if day is Sunday (coded as 2), otherwise} \ X_{i3} =0. \]

To test for regression relation.

qf(1-.05, 7, 357)
## [1] 2.035252

\(H_0: \beta_1 = \beta_{11} = \beta_{111} = \beta_{1111} = \beta_{11111}= 0\)

\(H_a: \text{at least one} \ \beta_k \ne 0\)

If \(F* \le 2.035252\), conclude the null. Otherwise conclude the alternative.

Since \(F* = 304.5 > 2.035252\) we conclude the alternative in that there is a regression relation.

We will therefore keep all of the other terms in the model.

Confidence intervals for weekdays for dayofyear being 120 and 130

predict(birthlm3, data.frame(dayofyear=c(120,230), sday="0"), interval="confidence", level=1-0.05/2)
##         fit       lwr       upr
## 1  9025.523  8946.187  9104.859
## 2 10167.177 10087.116 10247.237

Confidence intervals for Saturday for dayofyear being 120 and 130

predict(birthlm3, data.frame(dayofyear=c(120,230), sday="1"), interval="confidence", level=1-0.05/2)
##        fit      lwr      upr
## 1 7787.451 7669.135 7905.766
## 2 8929.105 8810.994 9047.215

Confidence intervals for Sundays for dayofyear being 120 and 130

predict(birthlm3, data.frame(dayofyear=c(120,230), sday="2"), interval="confidence", level=1-0.05/2)
##        fit      lwr      upr
## 1 7441.033 7323.362 7558.705
## 2 8582.687 8464.873 8700.502

Its not required but the sake of curiosity here is how the centered variables reducted correlation.

Correlation between dayofyear and dayofyear^2.

cor(BirthNew$dayofyear, (BirthNew$dayofyear)^2)
## [1] 0.9684119

Correlation between x and x^2.

cor((BirthNew$dayofyear-mean(BirthNew$dayofyear)), (BirthNew$dayofyear-mean(BirthNew$dayofyear))^2)
## [1] 0

Using a centered variables greatly helps to reduce correlation and thereby multicollinearity.

We should see the same pattern in all cases.